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Abstract. The algorithms in the current sequential numerical linear 
algebra libraries (e.g. LAPACK) do not parallelize well on multicore 
architectures. A new family of algorithms, the tile algorithms, has re- 
cently been introduced. Previous research has shown that it is possible 
to write efficient and scalable tile algorithms for performing a Cholesky 
factorization, a (pseudo) LU factorization, and a QR factorization. In 
this extended abstract, we attack the problem of the computation of the 
inverse of a symmetric positive definite matrix. We observe that, using 
a dynamic task scheduler, it is relatively painless to translate existing 
LAPACK code to obtain a ready-to-be-executed tile algorithm. However 
we demonstrate that non trivial compiler techniques (array renaming, 
loop reversal and pipelining) need then to be applied to further increase 
the parallelism of our application. We present preliminary experimental 
results. 



1 Introduction 

The appropriate direct method to compute tbe solution of a symmetric positive 
definite system of linear equations consists of computing the Cholesky factoriza- 
tion of that matrix and then solving the underlying triangular systems. It is not 
recommended to use the inverse of the matrix in this case. However some appli- 
cations need to explicitly form the inverse of the matrix. A canonical example 
is the computation of the variance-covariance matrix in statistics. Higham [12, 
p. 260, §3] lists more such applications. 

With their advent, multicore architectures [18] induce the need for algorithms 
and libraries that fully exploit their capacities. A class of such algorithms - 
called tile algorithms [8,9] - has been developed for one-sided dense factoriza- 
tions (Cholesky, LU and QR) and made available as part of the Parallel Linear 
Algebra Software for Multicore Architectures (PLASMA) library [2] . In this pa- 
per, we extend this class of algorithms to the case of the (symmetric positive 



definite) matrix inversion. Besides constituting an important functionality for 
a library such as PLASMA, the study of the matrix inversion on multicore ar- 
chitectures represents a challenging algorithmic problem. Indeed, first, contrary 
to standalone one-sided factorizations that have been studied so far, the ma- 
trix inversion exhibits many anti-dependences [4] (Write After Read). Those 
anti-dependences can be a bottleneck for parallel processing, which is critical on 
multicore architectures. It is thus essential to investigate (and adapt) well known 
techniques used in compilation such as using temporary copies of data to remove 
anti-dependences to enhance the degree of parallelism of the matrix inversion. 
This technique is known as array renaming [4] (or array privatization [11]). Sec- 
ond, loop reversal [4] is to be investigated. Third, the matrix inversion consists of 
three successive steps (first of which is the Cholesky decomposition). In terms of 
scheduling, it thus represents an opportunity to study the effects of pipelining [4] 
those steps on performance. 

The current version of PLASMA (version 2.1) is scheduled statically. Ini- 
tially developed for the IBM Cell processor [14], this static scheduling relies on 
POSIX threads and simple synchronization mechanisms. It has been designed to 
maximize data reuse and load balancing between cores, allowing for very high 
performance [3] on today's multicore architectures. However, in the case of the 
matrix inversion, the design of an ad-hoc static scheduling is a time consuming 
task and raises load balancing issues that are much more difficult to address 
than for a stand-alone Cholesky decomposition, in particular when dealing with 
the pipelining of multiple steps. Furthermore, the growth of the number of cores 
and the more complex memory hierarchies make executions less deterministic. 
In this paper, we rely on an experimental in-house dynamic scheduler [13]. This 
scheduler is based on the idea of expressing an algorithm through its sequential 
representation and unfolding it at runtime using data hazards (Read after Write, 
Write after Read, Write after Write) as constraints for parallel scheduling. The 
concept is rather old and has been validated by a few successful projects. We 
could have as well used schedulers from the Jade project from Stanford Univer- 
sity [17] or from the SMPSs project from the Barcelona Supercomputer Center 
[15]. 

Our discussions are illustrated with experiments conducted on a dual-socket 
quad-core machine based on an Intel Xeon EMT64 processor operating at 2.26 
GHz. The theoretical peak is equal to 9.0 Gflop/s per core or 72.3 Gflop/s for 
the whole machine, composed of 8 cores. The machine is running Mac OS X 
10.6.2 and is shipped with the Apple vecLib vl26.0 multithreaded BLAS [1] and 
LAPACK vendor library, as well as LAPACK [5] v3.2.1 and ScaLAPACK [7] 
vl.8.0 references. 

The rest of the paper is organized as follows. In Section 2, we present a 
new algorithm for matrix inversion based on tile algorithms; we explain how we 
articulated it with our dynamic scheduler to take advantage of multicore archi- 
tectures and we compare its performance against state-of-the-art libraries. In 
Section 3, we investigate the impact on parallelism and performance of different 



well known techniques used in compilation: loop reversal, array renaming and 
pipelining. We conclude and present future work directions in Section 4. 

2 Tile in-place matrix inversion 

Tile algorithms are a class of Linear Algebra algorithms that allow for fine gran- 
ularity parallelism and asynchronous scheduling, enabling high performance on 
multicore architectures [3, 8, 9, 16]. The matrix of order n is split into txt square 
submatrices of order b (n = b X t) . Such a submatrix is of small granularity (we 
fixed b — 200 in this paper) and is called a tile. So far, tile algorithms have been 
essentially used to implement one-sided factorizations [3,8,9, 16]. 

Algorithm 1 extends this class of algorithms to the case of the matrix in- 
version. As in state-of-the-art libraries (LAPACK, ScaLAPACK), the matrix 
inversion is performed in-place, i.e., the data structure initially containing ma- 
trix A is directly updated as the algorithm is progressing, without using any 
significant temporary extra-storage; eventually, A~ x substitutes A. Algorithm 1 
is composed of three steps. Step 1 is a Tile Cholesky Factorization computing 
the Cholesky factor L (lower triangular matrix satisfying A = LL T ). This step 
was studied in [9]. Step 2 computes L^ 1 by inverting L. Step 3 finally computes 
the inverse matrix A^ 1 = L~ lT L~ l . Each step is composed of multiple fine 
granularity tasks (since operating on tiles). These tasks are part of the BLAS 
(SYRK, GEMM, TRSM, TRMM) and LAPACK (POTRF, TRTRI, LAUUM) 
standards. A more detailed description is beyond the scope of this extended ab- 
stract and is not essential to the understanding of the rest of the paper. Indeed, 
from a high level point of view, an operation based on tile algorithms can be 
represented as a Directed Acyclic Graphs (DAG) [10] where nodes represent the 
fine granularity tasks in which the operation can be decomposed and the edges 
represent the dependences among them. For instance, Figure 1(a) represents 
the DAG of Step 3 of Algorithm 1. 




(a) In-place (Algo- (b) Out-of-place (variant introduced in Section 3) 

rithm 1) 



Fig. 1. DAGs of Step 3 of the Tile Cholesky Inversion (t = 4). 



Algorithm 1: Tile In-place Cholesky Inversion (lower format). Matrix A 
is the on-going updated matrix (in-place algorithm). 

Input: A, Symmetric Positive Definite matrix in tile storage (t x t tiles). 
Result: A^ 1 , stored in-place in A. 

Step 1: Tile Cholesky Factorization (compute L such that A — LL T ); 
for j — to t — 1 do 
for k = to j — 1 do 

L A i,i <- A jd - A jik * Aj ik (SYRK(j,k)) ; 
Am <- CHOL(Ajj) (POTRF(j)) ; 
for i = j + 1 to t — 1 do 
for k = to j — 1 do 
|_ Aij «- Aij - A itk * A\ k (GEMM(i,j,k)) ; 

for i = j ' + 1 to i — 1 do 

L «- Aij/Ajj (TRSM(ij)) ; 

Step 2: Tile Triangular Inversion of L (compute L^ 1 ); 
for j = t — 1 to do 

Aj,j <- TRJNV{A jd ) (TRTRI(j)) ; 
for i — t — ltoj + 1 do 

«- * (TRMM(iJ)) ; 
for k = j + ltoi — 1 do 
|_ Aij «- + Ai, fc * A fcj (GEMM(i,j,k)) ; 

|_ Ai,j <- -Aij * A v (TRMM(ij)) ; 

Step 3: T«Ze Product of Lower Triangular Matrices (compute A^ 1 = I/ _lT L _1 / ); 
for i = to £ — 1 do 
for j = to i — 1 do 
L Aij «- ^ * Aij (TRMM(ij)) ; 

Ai,i <- * A M (LAUUM(i)) ; 
for j = to i — 1 do 

for = i + 1 to f - 1 do 

|_ Aij «- Aij + A k i * A kJ (GEMM(i,j,k)) ; 

for k = i + ltot — 1 do 

L «- + Al ti * A kti (SYRK(i,k)) ; 



Algorithm 1 is based on the variants used in LAPACK 3.2.1. Bicntinesi, 
Gunter and van de Geijn [6] discuss the merit of algorithmic variations in the 
case of the computation of the inverse of a symmetric positive definite matrix. 
Although of definite interest, this is not the focus of this extended abstract. 

We have implemented Algorithm 1 using our dynamic scheduler introduced 
in Section 1. Figure 2 shows its performance against state-of-the-art libraries and 
the vendor library on the machine described in Section 1. For a matrix of small 
size, it is difficult to extract parallelism and have a full use of all the cores [3, 8, 9, 
16]. We indeed observe a limited scalability (N = 1000, Figure 2(a)). However, 
tile algorithms (Algorithm 1) still benefit from a higher degree of parallelism 
than blocked algorithms [3,8,9,16]. Therefore Algorithm 1 (in place) consis- 
tently achieves a significantly better performance than vecLib, ScaLAPACK and 
LAPACK libraries. A larger matrix size (N = 4000, Figure 2(b)) allows for a 
better use of parallelism. In this case, an optimized implementation of a blocked 
algorithm (vecLib) competes well against tile algorithms (in place) on few cores 
(left part of Figure 2(a)). However, only tile algorithms scale to a larger number 
of cores (rightmost part of Figure 2(b)) thanks to a higher degree of parallelism. 
In other words, the tile Cholesky inversion achieves a better strong scalability 
than the blocked versions, similarly to what had been observed for the factor- 
ization step [3, 8, 9, 16]. 




Fig. 2. Scalability of Algorithm 1 (in place) and its out-of-place variant introduced 
in Section 3, using our dynamic scheduler against vecLib, ScaLAPACK and LAPACK 
libraries. 



3 Algorithmic study 



In the previous section, we compared the performance of the tile Cholesky in- 
version against state-the-art libraries. In this section, we focus on tile Cholesky 
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Table 1. Length of the critical path as a function of the number of tiles t. 



inversion and we discuss the impact of several variants of Algorithm 1 on per- 
formance. 

Array renaming (removing anti-dependences). The dependence be- 
tween SYRK(0,1) and TRMM(1,0) in the DAG of Step 3 of Algorithm 1 (Fig- 
ure 1(a)) represents the constraint that the SYRK operation (1. 28 of Algo- 
rithm 1) needs to read Ak^i = A\fi before TRMM (1. 22) can overwrite Aij = 
A\fi. This anti-dependence (Write After Read) can be removed thanks to a tem- 
porary copy of A\fi. Similarly, all the SYRK-TRMM anti-dependences, as well 
as TRMM-LAUMM and GEMM-TRMM anti-dependences can be removed. We 
have designed a variant of Algorithm 1 that removes all the anti-dependences 
thanks to the use of a large working array (this technique is called array renam- 
ing [4] in compilation [4]). The subsequent DAG (Figure 1(b)) is split in multiple 
pieces (Figure 1(b)), leading to a shorter critical path (Table 1). We implemented 
the out-of-place algorithm, based on our dynamic scheduler too. Figure 2(a) 
shows that our dynamic scheduler exploits its higher degree of parallelism to 
achieve a much higher strong scalability even on small matrices (N — 1000). 
For a larger matrix (Figure 2(b)), the in-place algorithm already achieved very 
good scalability. Therefore, using up to 7 cores, their performance are similar. 
However, there is not enough parallelism with a 4000 x 4000 matrix to use ef- 
ficiently all 8 cores with the in-place algorithm; thus the higher performance of 
the out-of-place version in this case (leftmost part of Figure 2(b)). 

Loop reversal (exploiting commutativity). The most internal loop of 
each step of Algorithm 1 (1. 88, 1. 17 and 1. 26) consists in successive commuta- 
tive GEMM operations. Therefore they can be performed in any order, among 
which increasing order and decreasing order of the loop index. Their ordering 
impacts the length of the critical path. Algorithm 1 orders those three loops in 
increasing (U) and decreasing (D) order, respectively. We had manually chosen 
these respective orders (UDU) because they minimize the critical path of each 
step (values reported in Table 1). A naive approach would have, for example, 
been comprised of consistently ordering the loops in increasing order (UUU) . In 
this case (UUU), the critical path of TRTRI would have been equal to t 2 — 2t + 3 
(in-place) or (^t 2 — + 2) (out-of-place) instead of 3t — 3 (in-place) or 2t — 1 
(out-of-place) for (UDU). Figure 3 shows how loop reversal impacts performance. 

Pipelining. Pipelining the multiple steps of the inversion reduces the length 
of its critical path. For the in-place case, the critical path is reduced from 9t — 7 
tasks (t is the number of tiles) to 9t — 9 tasks (negligible) . For the out-of-place 
case, it is reduced from 6t — 3 to 5t — 2 tasks. We studied the effect of pipelining 
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Fig. 3. Impact of loop reversal on performance. 



on the performance of the inversion on a 8000 x 8000 matrix with an artificially 
large tile size (b — 2000 and t = 4). As expected, we observed almost no effect on 
performance of the in-place case (about 36.4 seconds with or without pipelining). 
For the out-of-place case, the elapsed time grows from 25.1 to 29.2 seconds (16 % 
overhead) when pipelining is prevented. 



4 Conclusion and future work 

We have proposed a new algorithm to compute the inverse of a symmetric posi- 
tive definite matrix on multicore architectures. An experimental study has shown 
both an excellent scalability of our algorithm and a significant performance im- 
provement compared to state-of-the-art libraries. Beyond extending the class of 
so-called tile algorithms, this study brought back to the fore well known issues 
in the domain of compilation. Indeed, we have shown the importance of loop 
reversal, array renaming and pipelining. 

The use of a dynamic scheduler allowed an out-of-the-box pipeline of the dif- 
ferent steps whereas loop reversal and array renaming required a manual change 
to the algorithm. The future work directions consist in enabling the scheduler 
to perform itself loop reversal and array renaming. We exploited the commu- 
tativity of GEMM operations to perform array renaming. Their associativity 
would furthermore allow to process them in parallel (following a binary tree); 
the subsequent impact on performance is to be studied. Array renaming requires 
extra-memory. It will be interesting to address the problem of the maximization 
of performance under memory constraint. This work aims to be incorporated 
into PLASMA. 
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